Here, we’ll be modeling housing price based on house characteristics using data from SLO home sales.
homes <- read_csv("slo_homes.csv") %>%
clean_names() %>%
filter(city %in% c("San Luis Obispo", "Atascadero", "Arroyo Grande"))
## Parsed with column specification:
## cols(
## City = col_character(),
## Price = col_double(),
## Bedrooms = col_double(),
## Bathrooms = col_double(),
## SqFt = col_double(),
## PricePerSqFt = col_double(),
## Status = col_character()
## )
beep(10) # 1-12 sounds
praise() # random praise check: name
## [1] "You are super-duper!"
praise("You are totally ${adjective}! Super ${EXCLAMATION}!")
## [1] "You are totally ultimate! Super HMM!"
Are there correlations betwee variables that we’d consider while trying to model home price?
Some exploring: look at correlations between numeric variables
homes_cor <- cor(homes[2:5])
homes_cor
## price bedrooms bathrooms sq_ft
## price 1.0000000 0.3115258 0.4991453 0.6618575
## bedrooms 0.3115258 1.0000000 0.7519186 0.6960233
## bathrooms 0.4991453 0.7519186 1.0000000 0.8046112
## sq_ft 0.6618575 0.6960233 0.8046112 1.0000000
corrplot(homes_cor,
method = "ellipse", # could also do "circle"
type = "upper")
# Moderate correlations between numeric variables (no concern about multicollinearity here from a correlation value standpoint, but there may be some conceptual overlap between variables)
praise()
## [1] "You are impeccable!"
# See: names(praise_parts) for other things you can call in praise
# For example, customize it a bit
praise("You are totally ${adjective}! Super ${EXCLAMATION}!")
## [1] "You are totally cat's meow! Super AHHH!"
Are there reasons to believe this should be a multiple linear model linear regression actual makes sense?
ggplot(data = homes, aes(x = sq_ft, y = price))+
geom_point()
ggplot(data = homes, aes(x = bedrooms, y = price))+
geom_point()
But let’s start out with a complete model(includes city, bedrooms, bathrooms, sq_ft, ) using all variables we have:
# price model as a function (~) of city, bedrooms, bathrooms, sq_ft and status
homes_lm <- lm(price ~ city + bedrooms + bathrooms + sq_ft + status, data = homes)
# View it:
homes_lm
##
## Call:
## lm(formula = price ~ city + bedrooms + bathrooms + sq_ft + status,
## data = homes)
##
## Coefficients:
## (Intercept) cityAtascadero citySan Luis Obispo
## 184130.9 -167396.4 31018.1
## bedrooms bathrooms sq_ft
## -161645.5 48692.0 389.1
## statusRegular statusShort Sale
## 303964.6 -19828.6
summary(homes_lm)
##
## Call:
## lm(formula = price ~ city + bedrooms + bathrooms + sq_ft + status,
## data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -843938 -115963 -12298 109718 3445077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 184130.93 143368.84 1.284 0.20157
## cityAtascadero -167396.39 78701.95 -2.127 0.03552 *
## citySan Luis Obispo 31018.14 114895.10 0.270 0.78766
## bedrooms -161645.51 49414.32 -3.271 0.00141 **
## bathrooms 48692.04 52408.63 0.929 0.35476
## sq_ft 389.12 53.17 7.318 3.43e-11 ***
## statusRegular 303964.64 105942.81 2.869 0.00488 **
## statusShort Sale -19828.56 86335.35 -0.230 0.81875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 380600 on 117 degrees of freedom
## Multiple R-squared: 0.5682, Adjusted R-squared: 0.5424
## F-statistic: 22 on 7 and 117 DF, p-value: < 2.2e-16
# give you the equation of the model
Price = 184130 - 167396(cityAtascadero) + 31018.14(citySan Luis Obispo)-161645.51(bedrooms)
Well that’s kind of a nightmare to look at. (bedrooms, bathrooms and suqre footage are all indicators about how big the house is) And putting it into a table could be really challenging. Now lets try another version of the model Just using sq_ft as amuasure of home size
homes_lm2 <- lm(price ~ city + sq_ft + status, data = homes)
summary(homes_lm2)
##
## Call:
## lm(formula = price ~ city + sq_ft + status, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -653118 -120914 -8844 101402 3644738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -105820.71 118697.40 -0.892 0.3745
## cityAtascadero -182586.90 80683.44 -2.263 0.0254 *
## citySan Luis Obispo 70278.97 115846.31 0.607 0.5452
## sq_ft 325.03 31.54 10.306 <2e-16 ***
## statusRegular 315441.26 109749.65 2.874 0.0048 **
## statusShort Sale 31284.44 87356.11 0.358 0.7209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 395100 on 119 degrees of freedom
## Multiple R-squared: 0.5268, Adjusted R-squared: 0.5069
## F-statistic: 26.49 on 5 and 119 DF, p-value: < 2.2e-16
AIC
AIC(homes_lm)
## [1] 3576.834
AIC(homes_lm2)
## [1] 3584.3
# we want the lower value of AIC.
# But the model making sense should be the most important thing
# adding the complecity comes more useful
We believe the model 2 is the most sound model conceptionally
Enter, the stargazer package.
stargazer(homes_lm, type = "html")
| Dependent variable: | |
| price | |
| cityAtascadero | -167,396.400** |
| (78,701.950) | |
| citySan Luis Obispo | 31,018.140 |
| (114,895.100) | |
| bedrooms | -161,645.500*** |
| (49,414.320) | |
| bathrooms | 48,692.040 |
| (52,408.630) | |
| sq_ft | 389.120*** |
| (53.171) | |
| statusRegular | 303,964.600*** |
| (105,942.800) | |
| statusShort Sale | -19,828.560 |
| (86,335.350) | |
| Constant | 184,130.900 |
| (143,368.800) | |
| Observations | 125 |
| R2 | 0.568 |
| Adjusted R2 | 0.542 |
| Residual Std. Error | 380,586.300 (df = 117) |
| F Statistic | 21.997*** (df = 7; 117) |
| Note: | p<0.1; p<0.05; p<0.01 |
Let’s answer a few questions:
Try another version of the model:
homes_lm2 <- lm(price ~ city + sq_ft + status, data = homes)
And check out the results:
stargazer(homes_lm2, type = "html") # And you can customize...
| Dependent variable: | |
| price | |
| cityAtascadero | -182,586.900** |
| (80,683.440) | |
| citySan Luis Obispo | 70,278.970 |
| (115,846.300) | |
| sq_ft | 325.028*** |
| (31.538) | |
| statusRegular | 315,441.300*** |
| (109,749.700) | |
| statusShort Sale | 31,284.440 |
| (87,356.110) | |
| Constant | -105,820.700 |
| (118,697.400) | |
| Observations | 125 |
| R2 | 0.527 |
| Adjusted R2 | 0.507 |
| Residual Std. Error | 395,084.400 (df = 119) |
| F Statistic | 26.491*** (df = 5; 119) |
| Note: | p<0.1; p<0.05; p<0.01 |
You can also use stargazer for multiple model comparisons:
stargazer(homes_lm, homes_lm2, type = "html")
| Dependent variable: | ||
| price | ||
| (1) | (2) | |
| cityAtascadero | -167,396.400** | -182,586.900** |
| (78,701.950) | (80,683.440) | |
| citySan Luis Obispo | 31,018.140 | 70,278.970 |
| (114,895.100) | (115,846.300) | |
| bedrooms | -161,645.500*** | |
| (49,414.320) | ||
| bathrooms | 48,692.040 | |
| (52,408.630) | ||
| sq_ft | 389.120*** | 325.028*** |
| (53.171) | (31.538) | |
| statusRegular | 303,964.600*** | 315,441.300*** |
| (105,942.800) | (109,749.700) | |
| statusShort Sale | -19,828.560 | 31,284.440 |
| (86,335.350) | (87,356.110) | |
| Constant | 184,130.900 | -105,820.700 |
| (143,368.800) | (118,697.400) | |
| Observations | 125 | 125 |
| R2 | 0.568 | 0.527 |
| Adjusted R2 | 0.542 | 0.507 |
| Residual Std. Error | 380,586.300 (df = 117) | 395,084.400 (df = 119) |
| F Statistic | 21.997*** (df = 7; 117) | 26.491*** (df = 5; 119) |
| Note: | p<0.1; p<0.05; p<0.01 | |
Now, check assumptions for normality nad homoscedasticity We can use the diagnostic plots to check assumptions about residuals, e.g.:
plot(homes_lm2)
# yep looks like
# for the qq plot,the data is almost perfectly normal distribution. the point 121 has higher price than our model predicts, it might because we miss some information, nor necessarily mean our model went wrong
# Residuals vs Leverage
plot(homes_lm)
Make a nice regression table:
stargazer(homes_lm2, type = "html")
| Dependent variable: | |
| price | |
| cityAtascadero | -182,586.900** |
| (80,683.440) | |
| citySan Luis Obispo | 70,278.970 |
| (115,846.300) | |
| sq_ft | 325.028*** |
| (31.538) | |
| statusRegular | 315,441.300*** |
| (109,749.700) | |
| statusShort Sale | 31,284.440 |
| (87,356.110) | |
| Constant | -105,820.700 |
| (118,697.400) | |
| Observations | 125 |
| R2 | 0.527 |
| Adjusted R2 | 0.507 |
| Residual Std. Error | 395,084.400 (df = 119) |
| F Statistic | 26.491*** (df = 5; 119) |
| Note: | p<0.1; p<0.05; p<0.01 |
First, we’ll make a new data frame containing all variables that home_lm2 needs to make a prediction:
Make sure that the variables we create for the new data match the variables that the model will be looking for to make new predictions.
new_df <- data.frame(
city = rep(c("San Luis Obispo", "Arroyo Grande", "Atascadero"), each = 10),
sq_ft = rep(seq(1000, 5000, length = 10)),
status = "Regular"
)
# rep: repet function rep(names, repet_times); sequence function seq()
# rep("Regular", 30) = "Regular"
new_df
## city sq_ft status
## 1 San Luis Obispo 1000.000 Regular
## 2 San Luis Obispo 1444.444 Regular
## 3 San Luis Obispo 1888.889 Regular
## 4 San Luis Obispo 2333.333 Regular
## 5 San Luis Obispo 2777.778 Regular
## 6 San Luis Obispo 3222.222 Regular
## 7 San Luis Obispo 3666.667 Regular
## 8 San Luis Obispo 4111.111 Regular
## 9 San Luis Obispo 4555.556 Regular
## 10 San Luis Obispo 5000.000 Regular
## 11 Arroyo Grande 1000.000 Regular
## 12 Arroyo Grande 1444.444 Regular
## 13 Arroyo Grande 1888.889 Regular
## 14 Arroyo Grande 2333.333 Regular
## 15 Arroyo Grande 2777.778 Regular
## 16 Arroyo Grande 3222.222 Regular
## 17 Arroyo Grande 3666.667 Regular
## 18 Arroyo Grande 4111.111 Regular
## 19 Arroyo Grande 4555.556 Regular
## 20 Arroyo Grande 5000.000 Regular
## 21 Atascadero 1000.000 Regular
## 22 Atascadero 1444.444 Regular
## 23 Atascadero 1888.889 Regular
## 24 Atascadero 2333.333 Regular
## 25 Atascadero 2777.778 Regular
## 26 Atascadero 3222.222 Regular
## 27 Atascadero 3666.667 Regular
## 28 Atascadero 4111.111 Regular
## 29 Atascadero 4555.556 Regular
## 30 Atascadero 5000.000 Regular
Then, use the predict() function to find the predicted home prices for each combination in new_df:
# Make the predictions using new_df:
# predict(my actual model name, the values I want to precit)
predict_df <- predict(homes_lm2, newdata = new_df)
# Then bind predictions together with new_df:
full_df <- data.frame(new_df, predict_df)
full_df
## city sq_ft status predict_df
## 1 San Luis Obispo 1000.000 Regular 604927.5
## 2 San Luis Obispo 1444.444 Regular 749384.3
## 3 San Luis Obispo 1888.889 Regular 893841.2
## 4 San Luis Obispo 2333.333 Regular 1038298.1
## 5 San Luis Obispo 2777.778 Regular 1182754.9
## 6 San Luis Obispo 3222.222 Regular 1327211.8
## 7 San Luis Obispo 3666.667 Regular 1471668.7
## 8 San Luis Obispo 4111.111 Regular 1616125.5
## 9 San Luis Obispo 4555.556 Regular 1760582.4
## 10 San Luis Obispo 5000.000 Regular 1905039.3
## 11 Arroyo Grande 1000.000 Regular 534648.5
## 12 Arroyo Grande 1444.444 Regular 679105.4
## 13 Arroyo Grande 1888.889 Regular 823562.2
## 14 Arroyo Grande 2333.333 Regular 968019.1
## 15 Arroyo Grande 2777.778 Regular 1112476.0
## 16 Arroyo Grande 3222.222 Regular 1256932.8
## 17 Arroyo Grande 3666.667 Regular 1401389.7
## 18 Arroyo Grande 4111.111 Regular 1545846.6
## 19 Arroyo Grande 4555.556 Regular 1690303.4
## 20 Arroyo Grande 5000.000 Regular 1834760.3
## 21 Atascadero 1000.000 Regular 352061.6
## 22 Atascadero 1444.444 Regular 496518.5
## 23 Atascadero 1888.889 Regular 640975.3
## 24 Atascadero 2333.333 Regular 785432.2
## 25 Atascadero 2777.778 Regular 929889.1
## 26 Atascadero 3222.222 Regular 1074345.9
## 27 Atascadero 3666.667 Regular 1218802.8
## 28 Atascadero 4111.111 Regular 1363259.7
## 29 Atascadero 4555.556 Regular 1507716.5
## 30 Atascadero 5000.000 Regular 1652173.4
# point style - pch
ggplot() +
geom_point(data = homes,
aes(x = sq_ft,
y = price,
color = city,
pch = city),
size = 1,
alpha = 0.5) +
geom_line(data = full_df,
aes(x = sq_ft,
y = predict_df,
color = city)) +
scale_color_manual(values = c("orange", "magenta", "black")) +
theme_light()
…but statistics are not substitute for judgement!
AIC(homes_lm)
## [1] 3576.834
AIC(homes_lm2)
## [1] 3584.3
# Pretty close, but the first model has a lower AIC. Should I pick the first model just based on AIC? NO!
ca_border <- read_sf(here::here("ca_state_border"), layer = "CA_State_TIGER2016")
plot(ca_border)
## Warning: plotting the first 9 out of 14 attributes; use max.plot = 14 to plot
## all
Then plot them together with ggplot2!
ggplot() +
geom_sf(data = dams_sf)
ggplot() +
geom_sf(data = ca_border)
# Combine:
ggplot() +
geom_sf(data = ca_border,
color = "white",
fill = "grey40") +
geom_sf(data = dams_sf,
size = 1,
alpha = 0.5,
color = "orange") +
theme_minimal()
beepr::beep(11)